This assignment is for ETC5521 Assignment 1 by Team Grevillea comprising of (Samuel Lyubic), (Brendi Ang), Dewi Lestari Amaliah, and Yiwen Jiang.
The Tour de France is a cycling tournament that is held in France annually which spans across 21 stages over 23 gruelling days. It is considered one of the most prestigious races that elite cyclists can partake in given the level of diffiulty of riding through the ranging beautiful French landscapes (EEB (2020)). Despite being an individual event, riders travel through the whole country in teams of 8 in order to strategically position the team leader in the best position to win the general classfications. This report mainly analyses the changes in the Tour de France over the past hundred years and the changes in riders’ characteristics. Our statistical programming used for analysis is R and Rstudio.
Tour De France (“Le Tour”) is an annual men’s bicycle race that in modern days encompasses a 21-day course that covers approximately 3,500km. It is predominantly held in France while often passing through other countries. (Encyclopedia Britannica (2020)). The teams would pass through long countryside roads, steep alpine regions and tight city areas, covering terrain ranging from rolling hills, long flat grounds and steep mountainous. The types of terrain dispersed across the different stages over the 23 days. The rider that completes the most stages in the shortest amount of time wins the overall title.
Due to the long duration of the race, the intelligence or athletic ability of the riders are not the only factors in winning the race. The key to winning the competition also lies in the choice of strategy. Besides, with technological changes in the past hundred years and people’s living standards have improved, various changes have taken place in the Tour de France. We will present our findings of the Tour de France through the exploration of the recording data from 1903 to 2019.
The original source of the Tour de France data set is from the tdf package written by Rushworth (2020). The data is then provided and available to download in the TidyTuesday’s GitHub repository by Mock (2020). There are three data sets regarding Tour de France available in this repository, namely tdf_winners.csv, stage_data.csv, and tdf_stages.csv. These data sets only covered the time period up to 2017, so we got the data fro 2018 and 2019 edition from Wikipedia (Wikipedia contributors 2020) and (Wikipedia contributors 2020).
Figure 1.1: Visualise the missing value in winner data
tdf_stages data inside tdf package only provided data till the 2017 edition while the other two data are covered all the way to 2019. Therefore, when using this data for analysis, we must consider whether to discard the data after 2017 or get the data from outher source like Wikipedia. If the analysis is for prediction, the data in recent years is the relatively important, and it is unreasonable to discard the data in recent years. However, when we choose to complete the data after 2017, it is necessary to ensure the consistency of the data, such as the method to record the data may be various between different institutions.The data set records competition information of Tour de France, published on Alastair Rushworth’s Github. The data recorded the competition information since the first organized in 1903, and 106 competitions been hold since then.
The Tour de France is a men’s multi-stage cycling race held annually in France and nearby countries. It was established in 1903 to increase the sales of the newspaper L’Auto. This event has become an important cultural event for European fans.
This dataset includes the information of 106 winners, stages of each competition and the riders’ information of each stage in the competition. The time frame of the data recording was started in 1903 and until 2019. Alastair Rushworth collected the recording information from various websites and other places and then integrates it into the dataset. The dataset was separated into three data files and provided by .csv format. The following are the variables in each data.
tdf_winner data comes from tdf_winners.csv. The data contains information about 106 winners of the Tour de France from 1903 to 2019. The part of the variables are showing in the Table 2.1.| Variable | Class | Description |
|---|---|---|
| edition | integer | Edition of the Tour de France |
| start_date | double | Start date of the Tour |
| winner_name | character | Winner’s name |
| winner_team | character | Winner’s team (NA if not on a team) |
| distance | double | Distance traveled in KM across the entire race |
| time_overall | double | Time in hours taken by the winner to complete the race |
| time_margin | double | Difference in finishing time between the race winner and the runner up |
| height | double | Height in meters |
| weight | double | Weight in kg |
| age | integer | Age as winner |
| nationality | character | Nationality |
stage_data data comes from stage_data.csv. The data contains ranking information for each stage of the annual race. The variables are showing in the Table 2.2.| Variable | Class | Description |
|---|---|---|
| edition | integer | Race edition |
| year | double | Year of race |
| stage_results_id | character | Stage ID |
| rank | character | Rank of racer for stage |
| time | double | Time of racer |
| rider | character | Rider name |
| age | integer | Age of racer |
| team | character | Team (NA if not on team) |
| points | integer | Points for the stage |
| elapsed | double | Time elapsed stored as lubridate::period |
| bib_number | integer | Bib number |
stage_data data comes from stage_data.csv. The data contains information of each stage for the annual race. The variables are showing in the Table 2.3.| Variable | Class | Description |
|---|---|---|
| Stage | character | Stage Number |
| Date | double | Date of stage |
| Distance | double | Distance in KM |
| Origin | character | Origin city |
| Destination | character | Destination city |
| Type | character | Stage Type |
| Winner | character | Winner of the stage |
| Winner_Country | character | Winner’s nationality |
The dataset is primarily used to analysis the changes on the Tour de France over hundred years. The primary question to answer from this dataset is how the performance of those riders is. After an overview of the primary question, we will conduct more specific analysis through the secondary questions as showing below:
The dataset has used come from the Alastair Rushworth’s Data Package tdf package. The dataset contains information about the overall winning rider for each edition of the race. The winner’s biographical information and the results for each stage in each edition. To install the package, use install_github("alastairrushworth/tdf").
- The winner data can be imported from editions in the tdf package, we only need to filter out the stage_results variable.
- The stage_data dataset is also import by tdf::editions, the stage information are been nested on the stage_results variable. We use the unnest_longer() function to a rectangle the nested stage data into a tidy tibble, and then use the flatten_df() function to flatten a list of lists into a simple vector. This process is essential because stage data were nested in the editions data, we cannot read stage data directly from editions data. Finally, select the relevant variables and use the year() function in the lubridate package to extract the year of each race.
- It is the most convenient way to directly read the tdf_stages data from the tdf package, but it only provided data till the 2017 edition while the actual dataset covered all the way to 2019. Thus, we opted to use the cleaning script found in the GitHub. Furthermore, our debugging process includes renaming inconsistent variable names, extracting components of date-time objects and using regular expressions to fix the structure of character strings to reproduce the same data set that was intended. In addition, we web scraped the data from Wikipedia (Wikipedia contributors (2020) and Wikipedia contributors (2020)) to obtain the stages data set for 2018 and 2019. These Wiki pages correspond to how the actual data set obtained its data; Thus, binding the data was straightforward as the structure the data was analogous with the tdf_stages data set.
Expand here to see part of the data cleaning codes
library(tidyverse)
library(tdf) # install at: https://github.com/alastairrushworth/tdf
winners <- tdf::editions %>%
select(-stage_results)
all_years <- tdf::editions %>%
unnest_longer(stage_results) %>%
mutate(stage_results = map(stage_results, ~ mutate(.x, rank = as.character(rank)))) %>%
unnest_longer(stage_results)
stage_all <- all_years %>%
select(stage_results) %>%
flatten_df()
combo_df <- bind_cols(all_years, stage_all) %>%
select(-stage_results)
stage_clean <- combo_df %>%
select(edition, start_date,stage_results_id:last_col()) %>%
mutate(year = lubridate::year(start_date)) %>%
rename(age = age...25) %>%
select(edition, year, everything(), -start_date)
winners %>%
write_csv(here::here("2020", "2020-04-07", "tdf_winners.csv"))
stage_clean %>%
write_csv(here::here("2020", "2020-04-07", "stage_data.csv"))
By overview of the Tour de France competition, first we begain to go through the performance of these winners and also the riders. We use the winners data to visualise how many times the riders have won the competition and use the stage_clean data to see how they perform on each of the stage in the competition.
Figure 3.1: The number of times the rider win the Tour de France
Refer to Figure 1, it presents the rank of number ot times the rider won the Tour de France in history. Lance Armstrong has won the competition seven total times, which is higher than most of the riders. This brings us to an interesting question, does the winner achieve high rank or extraordinary performance in most of the stages?
Figure 3.2: The number of times the rider win the Tour de France
The animation displays the changes in the cumulative points of riders in 2019 (the higher rank the higher points), It only shows the top 15 drivers with accumulated points. The red bar is the winner of 2019, Bernal Egan. It is not difficult to find that the winner does not need to achieve very good results at every stage. Compared with the riders who got the higher cumulative ponits, the winners’ cumulative ponits are only half of theirs.
These figures give us a more in-depth exploration of data guidance. What are the characteristics of Tour de France winner riders? How does the distance and speed of the Tour de France change? Is the Tour de France becoming more competitive?
For sports, the physical fitness of an athlete is the core factor that affects the outcome of the game. Especially for races like the Tour de France, the 23-day race is a severe test for athletes’ physical fitness.
BMI is usually an effective indicator for determining a person’s physical health. For example, a low BMI implies that the person will suffer and experienced weakened immune systems and/or weakened bones. We combine the height and weight variables of athletes into BMI according to the following formula:
\(BMI = \frac{Weight(kg)}{(Height(m))^2}\)
Figure 3.3: BMI of previous Tour de France winners
Figure 1 shows the BMI value of the winners of the Tour de France in each year. We can observe that the BMI of the winners is mostly concentrated in a certain range. The red dash line is representing the range of appropriate BMI for adult males, which is between 19 and 25. In recent years, the BMI of the winners of the Tour de France has a certain downward trend, but this trend is not very obvious.
Figure 3.4: The age of the previous winners of the Tour de France
Age often plays a big factor in the assessment of an athlete potential performance. The plots are present in Figure 2. The left panel display the trend of the average age of the winners (average over a decade). We can see that although the average age has a high volatile over the hundred years, if we focus on the trend after 1980 the average age of the winners has gradually increased, and the average age of those winners has achieved 29 in the past ten years. The right panel shows the age distribution of all winners. We can roughly see that the age distribution of the winners is mainly between 20 and 35 years old, and the average is about 27 years old. Compared to other competitions, the winners of the Tour de France are relatively older.
Figure 3.5: The number of times the Tour de France country has won
Figure 3.6: The number of times the Tour de France team has won
If we compare the number of winners between countries (Refer to Figure 3), the number of France far exceeds the number of second place Belgium. Figure 4 compares the number of winners between the teams. The French team has four more winners than Alcyon-Dunlop which is ranked second.
It is not difficult to observe that both the number of winners between countries and teams, the number of ranked first are much higher than the following. This is because the Tour de France also focuses on the strategy of the competition. A good team or country has a better strategy that is more suitable for the comprtition. Specifically, the financial backing of these larger teams allow them to attract top tier riders, coaches and staff given the pay packets they are able to provide as well as being able to afford all the the state of the art practices that assist performance thus potentially improving their dominance relative to smaller teams ((“What Makes Ineos Unbeatable?” 2019)). This may indicate that a rider may have a better chance of winning the Tour de France depending on the team they choose/are chosen for.
The following section will be analyzing the in-completion rate across the different types of stages. Each stage carries with it it’s own unique difficulties and understanding where riders have historically shown the greatest weakness could be beneficial at understanding stages that should be targeted and where energy needs to be conserved for. In order to conduct this analysis the total number riders over who failed to complete each stage was divided by the total number competitors who have competed in the stage over time. The stage_clean dataset lists riders who “did not finish”, “over the time limit” and “not qualified” as “DNF”, “OTL” and “NQ” as their rank for the stage, these ranks were tallied up for every stage as well as the participants for every stage and then divided over each other to produce the percentage rate of in-completion for each stage.
Figure 3.7: The percentage of riders that have not finished the main for stages, since 1969
Figure 3.7 visualises the percentage of total riders that have failed to complete the specific stage types with the stage type on the x axis and the percentage on the y axis. The Mountain and Hilly stages show to be the hardest terrains for riders to finish. Percentage of riders who have failed to complete each stage are:
Time trial’s have the least percentage of failed to complete riders with .29% and Flat stages being over double that. Time trial stages are run individually and are usually shorter distances on most often on more flatter terrain these characteristics can make it safer with reduced physical risk and collision risk. This may show that time trials are vital stage to master given their low in-completion rate which could be due to their safer conditions, which could allow for skilled riders to maximise their abilities and use these stages to improve their time gap with competitors. Furthemore, given mountain stages and hilly stages show higher incompletion rates, it would indicate that tailoring training and strategy to these stages could be beneficial in the outcome of a riders race as well as, given the greater historical difficulty relative to the other stages as shown by the higher rate, these could be stages where a rider looks to target the competition by making these stages a strength in order to increase their position when other riders may struggle more.
The distance covered in Tour de France through the history
Figure 3.8: The distance of Tour de France Route (1903-2019)
The speed of the winners. How does the bicycle technology development and doping usage affect?
Figure 3.9: Speed of the winners through the history (1903-2018)
With the increase in the number of holding of the Tour de France and the development of technology, the equipment becomes more advanced and the team’s strategy becomes more mature, which may lead to more intense competition in the competition. In the following, we will analyse whether the Tour de France has become more competitive based on the data.
Figure 3.10: Average hours taken by the winner to complete the race
As showing in Figure 5, the average of total hours taken by the winner to complete the race are increased before 1920 and reached its peak around 1920 which is over two hundred hours. After 1920, the average total hours began to decline, and until the past decades, the average total hours have dropped to less than ninty hours. However, this result is still not enough support that the riders is getting faster than before and the competition getting more competitive, because the reduction in average total hours is also affected by the distance of the races.
Figure 3.11: Average difference in finishing time between the race winner and the runner up
Next we can look at the time margin trend (Refer to Figure 6), the line represents the trend of time margin over decades. From the figure, it is obvious that the time margin is decreasing, which means that the time gap between the winner and the second place is getting smaller. By 2010, the time margin was already less than three minute. This means that regardless of the length of the distance, the performance difference between the riders has become smaller and the competition has become more intense. A slight mistake may lose the chance of winning.
At the conclusion of out analysis it is evident a rider must possess a range of different attributes and strategies in place in order to win the general classification. Specifically, the analysis highlights the importance of targeting appropriate stages to attack the competition, training for the right terrain and building the most efficient body to endure the gruelling 23 day tour in order to come out victorious with the general classification.
Furthermore, as tour organisers organised new trophies, the number of riders in contention of the yellow jersey have dwindled over the years as riders focuse on specific terrains, making it more difficult for overall winners to lead the stage. Each year, tour organisers vary the stages for the course, forcing riders to tailor their tactics for the different adversities in each course.
Overall winners do not necessarily win all stages, but have appeared to be consistent in ranks. In most cases, overall winners climb the ranks gradually, ranking less than 40 and stages 1-7 and less than 20 from stages 8 onwards respectively. Our analysis implied that overall winners may start wearing the yellow jersey at stage 17 onwards.
Our analysis asserts overall winners are all rounders and deploy strategic tactics. On top of their physiological traits, overall winners are usually formidable at climbing mountains. Additionally, overall winners does not triumph without his team as the speediest rider may not surpass a group of cyclists. Team allows their team leader take advantage of aerodynamics to ride at higher speeds with less energy expended, preserving his energy for critical stages.
Mock, Thomas. 2020. “Tidy Tuesday Tour de France Dataset.” https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-04-07/readme.md.
Rushworth, Alastair. 2020. Tdf: Tour de France Data. https://alastairrushworth.github.io/tdf/.
Wikipedia contributors. 2020. “2018 Tour de France — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=2018_Tour_de_France&oldid=960595512.
Wikipedia contributors. 2020. “2019 Tour de France — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=2019_Tour_de_France&oldid=968463007.